182 research outputs found

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

    Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method

    Get PDF
    Pipelined Krylov subspace methods typically offer improved strong scaling on parallel HPC hardware compared to standard Krylov subspace methods for large and sparse linear systems. In pipelined methods the traditional synchronization bottleneck is mitigated by overlapping time-consuming global communications with useful computations. However, to achieve this communication hiding strategy, pipelined methods introduce additional recurrence relations for a number of auxiliary variables that are required to update the approximate solution. This paper aims at studying the influence of local rounding errors that are introduced by the additional recurrences in the pipelined Conjugate Gradient method. Specifically, we analyze the impact of local round-off effects on the attainable accuracy of the pipelined CG algorithm and compare to the traditional CG method. Furthermore, we estimate the gap between the true residual and the recursively computed residual used in the algorithm. Based on this estimate we suggest an automated residual replacement strategy to reduce the loss of attainable accuracy on the final iterative solution. The resulting pipelined CG method with residual replacement improves the maximal attainable accuracy of pipelined CG, while maintaining the efficient parallel performance of the pipelined method. This conclusion is substantiated by numerical results for a variety of benchmark problems.Comment: 26 pages, 6 figures, 2 tables, 4 algorithm

    Méthodes directes hors-mémoire (out-of-core) pour la résolution de systÚmes linéaires creux de grande taille

    Get PDF
    Factorizing a sparse matrix is a robust way to solve large sparse systems of linear equations. However such an approach is known to be costly both in terms of computation and storage. When the storage required to process a matrix is greater than the amount of memory available on the platform, so-called out-of-core approaches have to be employed: disks extend the main memory to provide enough storage capacity. In this thesis, we investigate both theoretical and practical aspects of such out-of-core factorizations. The MUMPS and SuperLU software packages are used to illustrate our discussions on real-life matrices. First, we propose and study various out-of-core models that aim at limiting the overhead due to data transfers between memory and disks on uniprocessor machines. To do so, we revisit the algorithms to schedule the operations of the factorization and propose new memory management schemes to fit out-of-core constraints. Then we focus on a particular factorization method, the multifrontal method, that we push as far as possible in a parallel out-of-core context with a pragmatic approach. We show that out-of-core techniques allow to solve large sparse linear systems efficiently. When only the factors are stored on disks, a particular attention must be paid to temporary data, which remain in core memory. To achieve a high scalability of core memory usage, we rethink the whole schedule of the out-of-core parallel factorization.La factorisation d'une matrice creuse est une approche robuste pour la rĂ©solution de systĂšmes linĂ©aires creux de grande taille. NĂ©anmoins, une telle factorisation est connue pour ĂȘtre coĂ»teuse aussi bien en temps de calcul qu'en occupation mĂ©moire. Quand l'espace mĂ©moire nĂ©cessaire au traitement d'une matrice est plus grand que la quantitĂ© de mĂ©moire disponible sur la plate-forme utilisĂ©e, des approches dites hors-mĂ©moire (out-of-core) doivent ĂȘtre employĂ©es : les disques Ă©tendent la mĂ©moire centrale pour fournir une capacitĂ© de stockage suffisante. Dans cette thĂšse, nous nous intĂ©ressons Ă  la fois aux aspects thĂ©oriques et pratiques de telles factorisations hors-mĂ©moire. Les environnements logiciel MUMPS et SuperLU sont utilisĂ©s pour illustrer nos discussions sur des matrices issues du monde industriel et acadĂ©mique. Tout d'abord, nous proposons et Ă©tudions dans un cadre sĂ©quentiel diffĂ©rents modĂšles hors-mĂ©moire qui ont pour but de limiter le surcoĂ»t dĂ» aux transferts de donnĂ©es entre la mĂ©moire et les disques. Pour ce faire, nous revisitons les algorithmes qui ordonnancent les opĂ©rations de la factorisation et proposons de nouveaux schĂ©mas de gestion mĂ©moire s'accommodant aux contraintes hors-mĂ©moire. Ensuite, nous nous focalisons sur une mĂ©thode de factorisation particuliĂšre, la mĂ©thode multifrontale, que nous poussons aussi loin que possible dans un contexte parallĂšle hors-mĂ©moire. Suivant une dĂ©marche pragmatique, nous montrons que les techniques hors-mĂ©moire permettent de rĂ©soudre efficacement des systĂšmes linĂ©aires creux de grande taille. Quand seuls les facteurs sont stockĂ©s sur disque, une attention particuliĂšre doit ĂȘtre portĂ©e aux donnĂ©es temporaires, qui restent en mĂ©moire centrale. Pour faire dĂ©croĂźtre efficacement l'occupation mĂ©moire associĂ©e Ă  ces donnĂ©es temporaires avec le nombre de processeurs, nous repensons l'ordonnancement de la factorisation parallĂšle hors-mĂ©moire dans son ensemble

    Pipelining the Fast Multipole Method over a Runtime System

    Get PDF
    Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

    Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

    Get PDF
    International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications , namely, sparse matrix multifrontal factorizations which feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system. ACM Reference Format: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche and Florent Lopez, 2014. Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime system

    Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

    Get PDF
    International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability of runtime systems for complex applications, namely, sparse matrix multifrontal factorizations which constitute extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Experimental results on real-life matrices show that it is possible to achieve the same efficiency as with an ad hoc scheduler which relies on the knowledge of the algorithm. A detailed analysis shows the performance behavior of the resulting code and possible ways of improving the effectiveness of runtime systems

    Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

    Get PDF
    International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Comparison of coupled solvers for FEM/BEM linear systems arising from discretization of aeroacoustic problems

    Get PDF
    National audienceWhen discretization of an aeroacoustic physical model is based on the application of both the Finite Elements Method (FEM) and the Boundary Elements Method (BEM), this leads to coupled FEM/BEM linear systems combining sparse and dense parts. In this work, we propose and compare a set of implementation schemes relying on the coupling of the open-source sparse direct solver MUMPS with the proprietary direct solvers from Airbus Central R&T, i.e. the scalapack-like dense solver SPIDO and the hierarchical H-matrix compressed solver HMAT. For this preliminary study, we limit ourselves to a single 24-core computational node

    On the resilience of a parallel sparse hybrid solver

    Get PDF
    As the computational power of high performance computing (HPC) systemscontinues to increase by using a huge number of CPU cores orspecialized processing units, extreme-scale applications areincreasingly prone to faults. As a consequence, the HPC community hasproposed many contributions to design resilient HPC applications, maythese contributions be system-oriented, theoretical or numerical. Inthis study we consider an actual fully-featured parallel sparse hybrid(direct/iterative) linear solver, \maphys, and we propose numericalremedies to design a resilient version of the solver. The solver beinghybrid, we focus in this study on the iterative solution step, whichis often the dominant step in practice. We furthermore assume that aseparate mechanism ensures fault detection and that a system layerprovides support for setting back the environment (processes, \ldots)in a running state. The present manuscript therefore focuses on (andonly on) strategies for recovering lost data \emph{after} the faulthas been detected (a separate concern out of the scope of this study)and \emph{once} the system is back in a running state (anotherseparate concern not studied here either). The numerical remedies wepropose are twofold. Whenever possible, we exploit the natural data redundancybetween processes from the solver to perform exact recovery throughclever copies over processes. Otherwise, data that has been lost andis not available anymore on any process is recovered through aso-called interpolation-restart mechanism. This mechanism is derivedfrom~\cite{aggr:13} to carefully take into account the properties ofthe target hybrid solver. These numerical remedies have beenimplemented in the \maphys parallel solver so that we can assess theirefficiency on a large number of processing units (up to 12,28812,288 CPUcores) for solving large-scale real-life problems
    • 

    corecore